Skip to content

GH-3449: Fix BinaryStats int overflow#3448

Merged
wgtmac merged 2 commits intoapache:masterfrom
Jiayi-Wang-db:fix-binary-stats-overflow
Mar 18, 2026
Merged

GH-3449: Fix BinaryStats int overflow#3448
wgtmac merged 2 commits intoapache:masterfrom
Jiayi-Wang-db:fix-binary-stats-overflow

Conversation

@Jiayi-Wang-db
Copy link
Contributor

@Jiayi-Wang-db Jiayi-Wang-db commented Mar 16, 2026

Rationale for this change

BinaryStatistics.isSmallerThan() computes min.length() + max.length() as int + int. When a binary column contains values >= 2^30 bytes (1 GB), this sum overflows to a negative number, which is then incorrectly evaluated as smaller than MAX_STATS_SIZE (4096). This causes the full multi-GB min/max statistics to be serialized into the Thrift footer instead of being dropped, resulting in a corrupted Parquet file whose footer exceeds the 4-byte length field and cannot be read back.

What changes are included in this PR?

Promote int arithmetic to long before addition in BinaryStatistics.isSmallerThan() and isSmallerThanWithTruncation() to prevent integer overflow when binary values are very large.

Are these changes tested?

Yes. Added a unit test in TestStatistics that creates a 1 GB binary value and asserts isSmallerThan(4096) correctly returns false.

Are there any user-facing changes?

Parquet files with very large binary values (>= 1 GB) will no longer have corrupted footers. The oversized statistics are now correctly dropped, producing a valid and readable file.

Closes #3449

@Jiayi-Wang-db Jiayi-Wang-db changed the title Fix BinaryStats int overflow GH-3449: Fix BinaryStats int overflow Mar 16, 2026
Use Binary.fromConstantByteArray with a fake large length instead of
allocating a real 1 GB byte array. The test only exercises length()
arithmetic in isSmallerThan, so the backing bytes are never accessed.
@wgtmac wgtmac merged commit 689547c into apache:master Mar 18, 2026
5 checks passed
@wgtmac
Copy link
Member

wgtmac commented Mar 18, 2026

Thanks for the fix, @Jiayi-Wang-db!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Fix BinaryStats int overflow

2 participants